Supervised Lexical Acquisition for Persian from a Web Corpus
نویسندگان
چکیده
This paper reports on the compilation of a large Persian Web corpus and the cyclic supervised development of a lexicon and lemmatizer. We discuss the strategies adopted in compiling the corpus as well as some of the challenges in processing and tokenizing it. We also present the word patterns developed for the lemmatizer and the algorithms designed for the supervised lexical acquisition.
منابع مشابه
Towards Semi Automatic Construction of a Lexical Ontology for Persian
Lexical ontologies and semantic lexicons are important resources in natural language processing. They are used in various tasks and applications, especially where semantic processing is evolved such as question answering, machine translation, text understanding, information retrieval and extraction, content management, text summarization, knowledge acquisition and semantic search engines. Altho...
متن کاملDeep Lexical Acquisition of Type Properties in Low-resource Languages: A Case Study in Wambaya
We present a case study on applying common methods for the prediction of lexical properties to a low-resource language, namely Wambaya. Leveraging a small corpus leads to a typical high-precision, low-recall system; using the Web as a corpus has no utility for this language, but a machine learning approach seems to utilise the available resources most effectively. This motivates a semi-supervis...
متن کاملUnsupervised WSD based on Automatically Retrieved Examples: The Importance of Bias
This paper explores the large-scale acquisition of sense-tagged examples for Word Sense Disambiguation (WSD). We have applied the “WordNet monosemous relatives” method to construct automatically a web corpus that we have used to train disambiguation systems. The corpus-building process has highlighted important factors, such as the distribution of senses (bias). The corpus has been used to trai...
متن کاملExtracting Lexico-conceptual Knowledge for Developing Persian WordNet
Semantic lexicons and lexical ontologies are some major resources in natural language processing. Developing such resources are time consuming tasks for which some automatic methods are proposed. This paper describes some methods used in semi-automatic development of FarsNet; a lexical ontology for the Persian language. FarsNet includes the Persian WordNet with more than 10000 synsets of nouns,...
متن کاملPersian Wordnet Construction using Supervised Learning
This paper presents an automated supervised method for Persian wordnet construction. Using a Persian corpus and a bi-lingual dictionary, the initial links between Persian words and Princeton WordNet synsets have been generated. These links will be discriminated later as correct or incorrect by employing seven features in a trained classification system. The whole method is just a classification...
متن کامل